Capstone 6940
Cox regression is a popular regression modeling method used to predict survival rates given certain covariates.
“Survival” can refer to the development of a symptom, time to relapse after remission, or as a time to death [1].
Cox regression model is based on the hazard function h_x(t) with covariates at time t given by:
h_x(t)=h_0(t)\exp(\beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_n) h_x(t) = hazard \ function\ h_0(t) = Baseline \ hazard \ function\
\beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_n = linear\ combination\ of \ covariates\ and\ their\ coefficients
The assumption of a constant relationship between dependent and explanatory variables is called proportional hazards [2].
The hazard function is the probability that an individual will experience an event (death) within a certain time interval [1].
If the risk factor is binary, the result can be interpreted as the estimated increase in HR in patients with the risk factor vs. those without [3].
If the risk factor is continuous, the HR is interpreted as an increase/decrease in the hazard rate of death due to a 1 unit increase in the variable [3].
Graphical strategies to assess proportionality assumption:
Kaplan-Meier Survival Distribution S(t): Plot (S(t)) as a function of survival time for each level covariate.
Plot the function (\log(-\log(S(t)))) as a function of the log survival time (log represent natural log).
Schoenfeld Residuals
Failing to meet the assumption of proportional hazards means that the effects between dependent and explanatory variables are not constant over time.
Time-varying covariates (coefficients) are used when a covariate changes over time during the follow-up period [4].
Internal time-varying coefficients are affected by survival status and include values that are generated by the subject [4].
External time-varying coefficients are pre-determined and external to the subject under study [4].
Data Source: Employee Turnover Data from Kaggle
| Variable | Description |
|---|---|
| stag | Experience (time) |
| event | Employee turnover |
| gender | Employee’s gender |
| age | Employee’s age (year) |
| industry | Employee’s industry |
| profession | Employee’s profession |
| traffic | From what pipelene employee came to the company. |
| coach | Presence of a coach (training) on probation |
| head_gender | Head (supervisor) gender |
| greywage | The salary does not seem to the tax authorities. |
| way | Employee’s way of transportation |
| extraversion | Extroversion score |
| independ | Independence score |
| selfcontrol | Self control score |
| anxiety | Anxiety score |
| novator | Innovator score |
{r, warning=FALSE, echo=F, message=FALSE}
library(readr)
turnover <- read_csv("turnover.csv")
View(turnover)
library(tidyverse)
library(dplyr)
library(corrplot)
library(RColorBrewer)
library(ggfortify)
library(riskRegression)
library(survival)
library(stringr)
library(zoo)
library(ranger)
library(ggplot2)
library(readxl)
library(MASS)
library(ADGofTest)
library(survminer)
library(car)
# Check for null and duplicate values
cat("Number of missing values :", sum(is.na(turnover)))
cat("Number of duplicates: ", turnover %>%
duplicated() %>%
sum())
# Remove duplicate values
turnover <- unique(turnover)
# Check distribution of data
turnover %>%
ggplot(aes(x = stag, color = factor(event),
fill = factor(event))) +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
geom_density(alpha = 0.05) + scale_color_brewer(palette = "Dark2") +
scale_fill_brewer(palette = "Dark2") + theme_minimal() +
theme(legend.position = "top")
# Check for censored observations
n <- dim(turnover)[1]
cat((n - sum(turnover$event))/n * 100, "% of observations are censored")
# Separate variable based on type
NUM_COLS <- c("stag", "age", "extraversion", "independ", "selfcontrol", "anxiety", "novator")
CAT_COLS <- c("gender", "industry", "profession", "traffic", "coach", "head_gender", "greywage", "way")
# Transform CAT_COLS into categorical type
for (COL in CAT_COLS){
turnover[COL] <- turnover[COL] %>% unlist() %>% factor()}Number of missing values : 0
Number of duplicates: 13
49.82079 % of observations are censored
# Categorical covariates
# Selection of covariates (that are only discrete)
turnover.cat <- turnover %>%
select_if(is.factor) %>%
mutate(event = turnover$event)
turnover.cat %>% ggplot(aes(x = gender, color = factor(event),
fill = factor(event))) +
geom_bar(alpha = 0.5) +
scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
theme_minimal() + theme(legend.position = "top")# Continuous covariates
turnover %>% ggplot(aes(x = stag, color = factor(event),
fill = factor(event))) +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
geom_density(alpha = 0.05) +
scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Dark2") +
theme_minimal() + theme(legend.position = "top")# Other encoding of discrete variables
turnover.num <- turnover.cat[-9]
levels(turnover.num$gender) <- 1:length(levels(turnover.cat$gender))
levels(turnover.num$industry) <- 1:length(levels(turnover.cat$industry))
levels(turnover.num$profession) <- 1:length(levels(turnover.cat$profession))
levels(turnover.num$traffic) <- 1:length(levels(turnover.cat$traffic))
levels(turnover.num$coach) <- 1:length(levels(turnover.cat$coach))
levels(turnover.num$head_gender) <- 1:length(levels(turnover.cat$head_gender))
levels(turnover.num$greywage) <- 1:length(levels(turnover.cat$greywage))
levels(turnover.num$way) <- 1:length(levels(turnover.cat$way))
turnover.num <- as.data.frame(apply(turnover.num, 2, as.numeric))
turnover.num <- cbind(turnover.num, turnover %>%
select_if(is.numeric))
# Correlation matrix
corrplot(cor(turnover.num), col = brewer.pal(10, "BrBG"), method = "square", diag = FALSE)Now we will plot our categorical variables (gender, industry, profession, traffic, coach, head_gender, greywage, and way) using the Kaplan-Meier Estimation Curve. The Kaplan-Meier allows us to visualize how each category of survival probability differs from the others.
Call:
coxph(formula = Surv(stag, event) ~ ., data = turnover)
n= 1116, number of events= 560
coef exp(coef) se(coef) z Pr(>|z|)
genderm -0.110567 0.895327 0.127236 -0.869 0.384855
age 0.021321 1.021550 0.006992 3.050 0.002292 **
industryBanks -0.228597 0.795649 0.370386 -0.617 0.537112
industryBuilding -0.228859 0.795441 0.394945 -0.579 0.562272
industryConsult -0.387826 0.678530 0.378031 -1.026 0.304934
industryetc -0.571840 0.564486 0.376067 -1.521 0.128365
industryHoReCa -0.641488 0.526508 0.545700 -1.176 0.239782
industryIT -1.188633 0.304637 0.392874 -3.025 0.002482 **
industrymanufacture -0.799654 0.449484 0.373800 -2.139 0.032415 *
industryMining -0.603336 0.546984 0.449444 -1.342 0.179465
industryPharma -1.005353 0.365916 0.480887 -2.091 0.036562 *
industryPowerGeneration -1.032475 0.356124 0.451676 -2.286 0.022261 *
industryRealEstate -1.725592 0.178068 0.588426 -2.933 0.003362 **
industryRetail -1.042228 0.352668 0.363732 -2.865 0.004165 **
industryState -0.667077 0.513206 0.410168 -1.626 0.103875
industryTelecom -1.186918 0.305160 0.448385 -2.647 0.008119 **
industrytransport -0.852135 0.426504 0.427996 -1.991 0.046482 *
professionBusinessDevelopment 0.600890 1.823741 0.508792 1.181 0.237597
professionCommercial 0.998866 2.715201 0.507452 1.968 0.049022 *
professionConsult 0.570631 1.769384 0.520478 1.096 0.272921
professionEngineer 0.998848 2.715151 0.538726 1.854 0.063726 .
professionetc 0.486204 1.626132 0.486495 0.999 0.317600
professionFinance 0.054707 1.056231 0.526353 0.104 0.917220
professionHR 0.202198 1.224091 0.429582 0.471 0.637865
professionIT 0.069725 1.072213 0.491667 0.142 0.887228
professionLaw 0.403392 1.496894 0.647872 0.623 0.533520
professionmanage 1.284135 3.611541 0.500159 2.567 0.010245 *
professionMarketing 0.726077 2.066956 0.482551 1.505 0.132411
professionPR 0.846313 2.331035 0.640073 1.322 0.186097
professionSales 0.505293 1.657472 0.467728 1.080 0.280002
professionTeaching 0.617169 1.853673 0.569441 1.084 0.278446
trafficempjs 0.928219 2.529999 0.314661 2.950 0.003179 **
trafficfriends 0.122161 1.129937 0.342264 0.357 0.721151
trafficKA 0.141661 1.152186 0.353556 0.401 0.688659
trafficrabrecNErab 0.548112 1.729984 0.310084 1.768 0.077124 .
trafficrecNErab -0.051263 0.950029 0.381447 -0.134 0.893095
trafficreferal 0.368137 1.445041 0.325262 1.132 0.257711
trafficyoujs 0.654608 1.924387 0.309447 2.115 0.034395 *
coachno 0.056040 1.057640 0.111493 0.503 0.615221
coachyes 0.211061 1.234988 0.151771 1.391 0.164329
head_genderm 0.055167 1.056717 0.102616 0.538 0.590846
greywagewhite -0.505741 0.603059 0.135383 -3.736 0.000187 ***
waycar -0.201625 0.817401 0.103831 -1.942 0.052153 .
wayfoot -0.402833 0.668424 0.174165 -2.313 0.020726 *
extraversion 0.016623 1.016761 0.035460 0.469 0.639236
independ -0.019380 0.980806 0.035771 -0.542 0.587962
selfcontrol -0.045743 0.955287 0.035878 -1.275 0.202321
anxiety -0.048800 0.952372 0.034633 -1.409 0.158827
novator 0.009108 1.009150 0.030631 0.297 0.766196
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
genderm 0.8953 1.1169 0.6977 1.1489
age 1.0216 0.9789 1.0076 1.0356
industryBanks 0.7956 1.2568 0.3850 1.6444
industryBuilding 0.7954 1.2572 0.3668 1.7250
industryConsult 0.6785 1.4738 0.3234 1.4235
industryetc 0.5645 1.7715 0.2701 1.1797
industryHoReCa 0.5265 1.8993 0.1807 1.5343
industryIT 0.3046 3.2826 0.1410 0.6580
industrymanufacture 0.4495 2.2248 0.2160 0.9352
industryMining 0.5470 1.8282 0.2267 1.3199
industryPharma 0.3659 2.7329 0.1426 0.9391
industryPowerGeneration 0.3561 2.8080 0.1469 0.8631
industryRealEstate 0.1781 5.6158 0.0562 0.5642
industryRetail 0.3527 2.8355 0.1729 0.7194
industryState 0.5132 1.9485 0.2297 1.1466
industryTelecom 0.3052 3.2770 0.1267 0.7348
industrytransport 0.4265 2.3446 0.1843 0.9868
professionBusinessDevelopment 1.8237 0.5483 0.6728 4.9436
professionCommercial 2.7152 0.3683 1.0043 7.3408
professionConsult 1.7694 0.5652 0.6380 4.9074
professionEngineer 2.7152 0.3683 0.9446 7.8047
professionetc 1.6261 0.6150 0.6267 4.2195
professionFinance 1.0562 0.9468 0.3765 2.9634
professionHR 1.2241 0.8169 0.5274 2.8410
professionIT 1.0722 0.9327 0.4090 2.8105
professionLaw 1.4969 0.6681 0.4205 5.3292
professionmanage 3.6115 0.2769 1.3551 9.6256
professionMarketing 2.0670 0.4838 0.8028 5.3221
professionPR 2.3310 0.4290 0.6648 8.1730
professionSales 1.6575 0.6033 0.6627 4.1455
professionTeaching 1.8537 0.5395 0.6072 5.6590
trafficempjs 2.5300 0.3953 1.3655 4.6877
trafficfriends 1.1299 0.8850 0.5777 2.2100
trafficKA 1.1522 0.8679 0.5762 2.3039
trafficrabrecNErab 1.7300 0.5780 0.9421 3.1768
trafficrecNErab 0.9500 1.0526 0.4498 2.0064
trafficreferal 1.4450 0.6920 0.7639 2.7337
trafficyoujs 1.9244 0.5196 1.0493 3.5293
coachno 1.0576 0.9455 0.8500 1.3160
coachyes 1.2350 0.8097 0.9172 1.6628
head_genderm 1.0567 0.9463 0.8642 1.2921
greywagewhite 0.6031 1.6582 0.4625 0.7863
waycar 0.8174 1.2234 0.6669 1.0019
wayfoot 0.6684 1.4961 0.4751 0.9404
extraversion 1.0168 0.9835 0.9485 1.0899
independ 0.9808 1.0196 0.9144 1.0520
selfcontrol 0.9553 1.0468 0.8904 1.0249
anxiety 0.9524 1.0500 0.8899 1.0193
novator 1.0092 0.9909 0.9503 1.0716
Concordance= 0.661 (se = 0.012 )
Likelihood ratio test= 173.8 on 49 df, p=7e-16
Wald test = 176.7 on 49 df, p=3e-16
Score (logrank) test = 183.5 on 49 df, p=<2e-16
We want to refine the model to use the independent variable(s) with the greatest influence from our crude model. To do this, we will use the “step” function in R which iteratively adds or removes variables based on a chosen criterion. For this model we will use the AIC value. The Akaike Information Criterion (AIC) is a statistical metric that evaluates how well a model fits a given set of data.
Start: AIC=6717.67
Surv(stag, event) ~ gender + age + industry + profession + traffic +
coach + head_gender + greywage + way + extraversion + independ +
selfcontrol + anxiety + novator
Df AIC
- coach 2 6715.6
- novator 1 6715.8
- extraversion 1 6715.9
- head_gender 1 6716.0
- independ 1 6716.0
- gender 1 6716.4
- selfcontrol 1 6717.3
- anxiety 1 6717.7
<none> 6717.7
- profession 14 6720.5
- way 2 6721.8
- age 1 6724.8
- greywage 1 6728.3
- traffic 7 6741.6
- industry 15 6746.9
Step: AIC=6715.58
Surv(stag, event) ~ gender + age + industry + profession + traffic +
head_gender + greywage + way + extraversion + independ +
selfcontrol + anxiety + novator
Df AIC
- novator 1 6713.8
- head_gender 1 6713.8
- extraversion 1 6713.8
- independ 1 6713.9
- gender 1 6714.3
- selfcontrol 1 6715.2
<none> 6715.6
- anxiety 1 6715.7
- profession 14 6718.8
- way 2 6719.2
- age 1 6722.8
- greywage 1 6726.4
- traffic 7 6739.4
- industry 15 6746.4
Step: AIC=6713.76
Surv(stag, event) ~ gender + age + industry + profession + traffic +
head_gender + greywage + way + extraversion + independ +
selfcontrol + anxiety
Df AIC
- extraversion 1 6712.1
- head_gender 1 6712.1
- independ 1 6712.1
- gender 1 6712.5
- anxiety 1 6713.7
<none> 6713.8
- selfcontrol 1 6714.5
- profession 14 6716.9
- way 2 6717.3
- age 1 6721.1
- greywage 1 6724.4
- traffic 7 6737.5
- industry 15 6744.5
Step: AIC=6712.05
Surv(stag, event) ~ gender + age + industry + profession + traffic +
head_gender + greywage + way + independ + selfcontrol + anxiety
Df AIC
- head_gender 1 6710.4
- gender 1 6710.8
- independ 1 6711.0
<none> 6712.1
- anxiety 1 6713.4
- profession 14 6715.5
- way 2 6715.6
- selfcontrol 1 6717.3
- age 1 6719.1
- greywage 1 6722.5
- traffic 7 6736.4
- industry 15 6743.3
Step: AIC=6710.35
Surv(stag, event) ~ gender + age + industry + profession + traffic +
greywage + way + independ + selfcontrol + anxiety
Df AIC
- gender 1 6709.0
- independ 1 6709.4
<none> 6710.4
- anxiety 1 6711.9
- way 2 6713.9
- profession 14 6714.1
- selfcontrol 1 6715.9
- age 1 6719.1
- greywage 1 6720.9
- traffic 7 6734.4
- industry 15 6741.9
Step: AIC=6709
Surv(stag, event) ~ age + industry + profession + traffic + greywage +
way + independ + selfcontrol + anxiety
Df AIC
- independ 1 6708.2
<none> 6709.0
- anxiety 1 6711.7
- profession 14 6712.1
- way 2 6712.8
- selfcontrol 1 6715.3
- age 1 6717.6
- greywage 1 6720.0
- traffic 7 6733.2
- industry 15 6740.7
Step: AIC=6708.17
Surv(stag, event) ~ age + industry + profession + traffic + greywage +
way + selfcontrol + anxiety
Df AIC
<none> 6708.2
- anxiety 1 6709.7
- profession 14 6710.9
- way 2 6712.0
- selfcontrol 1 6713.4
- age 1 6716.1
- greywage 1 6719.4
- traffic 7 6732.4
- industry 15 6739.6
Call:
coxph(formula = Surv(stag, event) ~ age + industry + profession +
traffic + greywage + way + selfcontrol + anxiety, data = turnover)
coef exp(coef) se(coef) z p
age 0.02055 1.02076 0.00641 3.205 0.001349
industryBanks -0.28975 0.74845 0.36289 -0.798 0.424614
industryBuilding -0.26582 0.76658 0.38880 -0.684 0.494175
industryConsult -0.45691 0.63324 0.36927 -1.237 0.215967
industryetc -0.64403 0.52517 0.36896 -1.746 0.080890
industryHoReCa -0.77805 0.45930 0.54017 -1.440 0.149758
industryIT -1.24814 0.28704 0.38553 -3.237 0.001206
industrymanufacture -0.87933 0.41506 0.36757 -2.392 0.016744
industryMining -0.65940 0.51716 0.44284 -1.489 0.136485
industryPharma -1.04233 0.35263 0.47232 -2.207 0.027324
industryPowerGeneration -1.09181 0.33561 0.44269 -2.466 0.013651
industryRealEstate -1.81850 0.16227 0.58090 -3.130 0.001745
industryRetail -1.10739 0.33042 0.35511 -3.118 0.001818
industryState -0.73784 0.47814 0.40157 -1.837 0.066154
industryTelecom -1.25500 0.28508 0.43786 -2.866 0.004154
industrytransport -0.86358 0.42165 0.42212 -2.046 0.040775
professionBusinessDevelopment 0.59092 1.80564 0.50153 1.178 0.238702
professionCommercial 1.00727 2.73813 0.49840 2.021 0.043279
professionConsult 0.54819 1.73012 0.50936 1.076 0.281828
professionEngineer 0.94501 2.57284 0.52648 1.795 0.072662
professionetc 0.45756 1.58021 0.48289 0.948 0.343360
professionFinance 0.05804 1.05976 0.51797 0.112 0.910784
professionHR 0.22086 1.24714 0.42463 0.520 0.602987
professionIT 0.02934 1.02978 0.47418 0.062 0.950654
professionLaw 0.31703 1.37304 0.64071 0.495 0.620736
professionmanage 1.28255 3.60582 0.49774 2.577 0.009974
professionMarketing 0.70287 2.01955 0.47822 1.470 0.141624
professionPR 0.82179 2.27458 0.63824 1.288 0.197889
professionSales 0.50708 1.66044 0.45661 1.111 0.266774
professionTeaching 0.62625 1.87058 0.56833 1.102 0.270502
trafficempjs 0.85644 2.35476 0.30998 2.763 0.005729
trafficfriends 0.03493 1.03555 0.33616 0.104 0.917244
trafficKA 0.10265 1.10810 0.34993 0.293 0.769274
trafficrabrecNErab 0.48122 1.61805 0.30601 1.573 0.115817
trafficrecNErab -0.12679 0.88092 0.37753 -0.336 0.736996
trafficreferal 0.30802 1.36073 0.32125 0.959 0.337642
trafficyoujs 0.60301 1.82761 0.30666 1.966 0.049257
greywagewhite -0.51396 0.59812 0.13401 -3.835 0.000125
waycar -0.21103 0.80975 0.10259 -2.057 0.039682
wayfoot -0.37776 0.68539 0.17333 -2.179 0.029301
selfcontrol -0.06082 0.94099 0.02269 -2.680 0.007353
anxiety -0.04924 0.95196 0.02619 -1.880 0.060130
Likelihood ratio test=169.3 on 42 df, p=< 2.2e-16
n= 1116, number of events= 560
Using these results, we can see that the model including independent variables age, industry, profession, traffic, greywage, selfcontrol, and anxiety produced the most accurate survival model. We will use these variables in our adjusted model.
Call:
coxph(formula = Surv(stag, event) ~ age + industry + profession +
traffic + greywage + way + selfcontrol + anxiety, data = turnover)
n= 1116, number of events= 560
coef exp(coef) se(coef) z Pr(>|z|)
age 0.02055 1.02076 0.00641 3.205 0.001349 **
industryBanks -0.28975 0.74845 0.36289 -0.798 0.424614
industryBuilding -0.26582 0.76658 0.38880 -0.684 0.494175
industryConsult -0.45691 0.63324 0.36927 -1.237 0.215967
industryetc -0.64403 0.52517 0.36896 -1.746 0.080890 .
industryHoReCa -0.77805 0.45930 0.54017 -1.440 0.149758
industryIT -1.24814 0.28704 0.38553 -3.237 0.001206 **
industrymanufacture -0.87933 0.41506 0.36757 -2.392 0.016744 *
industryMining -0.65940 0.51716 0.44284 -1.489 0.136485
industryPharma -1.04233 0.35263 0.47232 -2.207 0.027324 *
industryPowerGeneration -1.09181 0.33561 0.44269 -2.466 0.013651 *
industryRealEstate -1.81850 0.16227 0.58090 -3.130 0.001745 **
industryRetail -1.10739 0.33042 0.35511 -3.118 0.001818 **
industryState -0.73784 0.47814 0.40157 -1.837 0.066154 .
industryTelecom -1.25500 0.28508 0.43786 -2.866 0.004154 **
industrytransport -0.86358 0.42165 0.42212 -2.046 0.040775 *
professionBusinessDevelopment 0.59092 1.80564 0.50153 1.178 0.238702
professionCommercial 1.00727 2.73813 0.49840 2.021 0.043279 *
professionConsult 0.54819 1.73012 0.50936 1.076 0.281828
professionEngineer 0.94501 2.57284 0.52648 1.795 0.072662 .
professionetc 0.45756 1.58021 0.48289 0.948 0.343360
professionFinance 0.05804 1.05976 0.51797 0.112 0.910784
professionHR 0.22086 1.24714 0.42463 0.520 0.602987
professionIT 0.02934 1.02978 0.47418 0.062 0.950654
professionLaw 0.31703 1.37304 0.64071 0.495 0.620736
professionmanage 1.28255 3.60582 0.49774 2.577 0.009974 **
professionMarketing 0.70287 2.01955 0.47822 1.470 0.141624
professionPR 0.82179 2.27458 0.63824 1.288 0.197889
professionSales 0.50708 1.66044 0.45661 1.111 0.266774
professionTeaching 0.62625 1.87058 0.56833 1.102 0.270502
trafficempjs 0.85644 2.35476 0.30998 2.763 0.005729 **
trafficfriends 0.03493 1.03555 0.33616 0.104 0.917244
trafficKA 0.10265 1.10810 0.34993 0.293 0.769274
trafficrabrecNErab 0.48122 1.61805 0.30601 1.573 0.115817
trafficrecNErab -0.12679 0.88092 0.37753 -0.336 0.736996
trafficreferal 0.30802 1.36073 0.32125 0.959 0.337642
trafficyoujs 0.60301 1.82761 0.30666 1.966 0.049257 *
greywagewhite -0.51396 0.59812 0.13401 -3.835 0.000125 ***
waycar -0.21103 0.80975 0.10259 -2.057 0.039682 *
wayfoot -0.37776 0.68539 0.17333 -2.179 0.029301 *
selfcontrol -0.06082 0.94099 0.02269 -2.680 0.007353 **
anxiety -0.04924 0.95196 0.02619 -1.880 0.060130 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0208 0.9797 1.00801 1.0337
industryBanks 0.7485 1.3361 0.36751 1.5243
industryBuilding 0.7666 1.3045 0.35777 1.6425
industryConsult 0.6332 1.5792 0.30707 1.3058
industryetc 0.5252 1.9041 0.25483 1.0823
industryHoReCa 0.4593 2.1772 0.15933 1.3240
industryIT 0.2870 3.4839 0.13483 0.6111
industrymanufacture 0.4151 2.4093 0.20195 0.8531
industryMining 0.5172 1.9336 0.21711 1.2319
industryPharma 0.3526 2.8358 0.13973 0.8899
industryPowerGeneration 0.3356 2.9797 0.14093 0.7992
industryRealEstate 0.1623 6.1626 0.05197 0.5066
industryRetail 0.3304 3.0264 0.16474 0.6627
industryState 0.4781 2.0914 0.21764 1.0505
industryTelecom 0.2851 3.5078 0.12085 0.6725
industrytransport 0.4217 2.3716 0.18435 0.9644
professionBusinessDevelopment 1.8056 0.5538 0.67567 4.8254
professionCommercial 2.7381 0.3652 1.03089 7.2727
professionConsult 1.7301 0.5780 0.63753 4.6951
professionEngineer 2.5728 0.3887 0.91679 7.2203
professionetc 1.5802 0.6328 0.61331 4.0715
professionFinance 1.0598 0.9436 0.38398 2.9249
professionHR 1.2471 0.8018 0.54258 2.8666
professionIT 1.0298 0.9711 0.40656 2.6084
professionLaw 1.3730 0.7283 0.39112 4.8201
professionmanage 3.6058 0.2773 1.35933 9.5650
professionMarketing 2.0195 0.4952 0.79103 5.1560
professionPR 2.2746 0.4396 0.65107 7.9465
professionSales 1.6604 0.6023 0.67850 4.0634
professionTeaching 1.8706 0.5346 0.61406 5.6983
trafficempjs 2.3548 0.4247 1.28261 4.3232
trafficfriends 1.0355 0.9657 0.53583 2.0013
trafficKA 1.1081 0.9024 0.55810 2.2001
trafficrabrecNErab 1.6181 0.6180 0.88821 2.9476
trafficrecNErab 0.8809 1.1352 0.42032 1.8463
trafficreferal 1.3607 0.7349 0.72498 2.5540
trafficyoujs 1.8276 0.5472 1.00196 3.3336
greywagewhite 0.5981 1.6719 0.45996 0.7778
waycar 0.8098 1.2349 0.66226 0.9901
wayfoot 0.6854 1.4590 0.48798 0.9627
selfcontrol 0.9410 1.0627 0.90005 0.9838
anxiety 0.9520 1.0505 0.90432 1.0021
Concordance= 0.66 (se = 0.012 )
Likelihood ratio test= 169.3 on 42 df, p=<2e-16
Wald test = 172.2 on 42 df, p=<2e-16
Score (logrank) test = 178.8 on 42 df, p=<2e-16
Hazard ratios above 1 indicate a higher likelihood of the individuals in that category to experience the event (termination) in the given timeframe.
age industryBanks
1.0207593 0.7484527
industryBuilding industryConsult
0.7665782 0.6332371
industryetc industryHoReCa
0.5251713 0.4593025
industryIT industrymanufacture
0.2870376 0.4150602
industryMining industryPharma
0.5171611 0.3526310
industryPowerGeneration industryRealEstate
0.3356074 0.1622689
industryRetail industryState
0.3304219 0.4781437
industryTelecom industrytransport
0.2850769 0.4216508
professionBusinessDevelopment professionCommercial
1.8056453 2.7381273
professionConsult professionEngineer
1.7301166 2.5728377
professionetc professionFinance
1.5802152 1.0597560
professionHR professionIT
1.2471440 1.0297797
professionLaw professionmanage
1.3730396 3.6058251
professionMarketing professionPR
2.0195459 2.2745773
professionSales professionTeaching
1.6604383 1.8705848
trafficempjs trafficfriends
2.3547649 1.0355461
trafficKA trafficrabrecNErab
1.1080976 1.6180539
trafficrecNErab trafficreferal
0.8809202 1.3607294
trafficyoujs greywagewhite
1.8276083 0.5981229
waycar wayfoot
0.8097502 0.6853927
selfcontrol anxiety
0.9409886 0.9519574
| Crude Model | Adjusted Model | |
|---|---|---|
| Covariate | Hazard ratio (95% CI), p-value | Hazard ratio (95% CI), p-value |
| genderm | (0.70, 1.15) 0.384 | Excluded |
| age | (1.01, 1.04) 0.002 | (1.01, 1.03) 0.001 |
| industryBanks | (0.39, 1.64) 0.537 | (0.37, 1.52) 0.424 |
| industryBuilding | (0.37, 1.73) 0.562 | (0.36, 1.64) 0.494 |
| industryConsult | (0.32, 1.42) 0.304 | (0.31, 1.31) 0.215 |
| industryetc | (0.27, 1.18) 0.128 | (0.25, 1.08) 0.080 |
| industryHoReCa | (0.18, 1.53) 0.239 | (0.16, 1.32) 0.149 |
| industryIT | (0.14, 0.66) 0.002 | (0.13, 0.61) 0.001 |
| industrymanufacture | (0.22, 0.94) 0.032 | (0.20, 0.85) 0.016 |
| industryMining | (0.23, 1.32) 0.179 | (0.21,1.23) 0.136 |
| industryPharma | (0.14, 0.94) 0.036 | (0.14, 0.89) 0.027 |
| industryPowerGeneration | (0.15, 0.86) 0.022 | (0.14, 0.80) 0.013 |
| industryRealEstate | (0.06, 0.56) 0.003 | (0.05, 0.51) 0.001 |
| industryRetail | (0.17, 0.72) 0.004 | (0.16, 0.66) 0.002 |
| industryState | (0.23,1.15) 0.104 | (0.22, 1.05) 0.066 |
| industryTelecom | (0.13, 0.73) 0.008 | (0.12, 0.67) 0.004 |
| industrytransport | (0.18, 0.99) 0.046 | (0.18, 0.96) 0.041 |
| professionBusinessDevelopment | (0.67, 4.94) 0.237 | (0.68, 4.83) 0.238 |
| professionCommercial | (1.00, 7.34) 0.049 | (1.03, 7.27) 0.043 |
| professionConsult | (0.64, 4.91) 0.272 | (0.64, 4.70) 0.281 |
| professionEngineer | (0.94, 7.80) 0.063 | (0.92, 7.22) 0.072 |
| professionetc | (0.63, 4.22) 0.317 | (0.61, 4.07) 0.343 |
| professionFinance | (0.377, 2.96) 0.917 | (0.38, 2.92) 0.910 |
| professionHR | (0.53, 2.84) 0.638 | (0.54, 2.87) 0.603 |
| professionIT | (0.41, 2.81) 0.887 | (0.41, 2.61) 0.951 |
| professionLaw | (0.42, 5.33) 0.533 | (0.39, 4.82) 0.621 |
| professionmanage | (1.36, 9.63) 0.010 | (1.36, 9.57) 0.01 |
| professionMarketing | (0.80, 5.32) 0.132 | (0.79, 5.16) 0.141 |
| professionPR | (0.66, 8.17) 0.186 | (0.65, 7.95) 0.197 |
| professionSales | (0.66, 4.15) 0.280 | (0.68, 4.06) 0.266 |
| professionTeaching | (0.61, 5.66) 0.278 | (0.61, 5.70) 0.270 |
| trafficempjs | (1.37, 4.69) 0.003 | (1.28, 4.32) 0.006 |
| trafficfriends | (0.58, 2.21) 0.721 | (0.54, 2.00) 0.917 |
| trafficKA | (0.58, 2.30) 0.688 | (0.56, 2.20) 0.769 |
| trafficrabrecNErab | (0.94, 3.18) 0.077 | (0.89, 2.95) 0.115 |
| trafficrecNErab | (0.45, 2.01) 0.893 | (0.42, 1.85) 0.736 |
| trafficreferal | (0.76, 2.73) 0.257 | (0.72, 2.55) 0.337 |
| trafficyoujs | (1.05, 3.53) 0.034 | (1.00, 3.33) 0.049 |
| coachno | (0.85,1.316) 0.615 | Excluded |
| coachyes | (0.92, 1.66) 0.164 | Excluded |
| head_genderm | (0.86, 1.29) 0.590 | Excluded |
| greywagewhite | (0.46, 0.79) 0.0002 | (0.46, 0.78) 0.0001 |
| waycar | (0.67, 1.00) 0.052 | (0.66, 0.99) 0.039 |
| wayfoot | (0.48, 0.94) 0.020 | (0.49, 0.96) 0.029 |
| extraversion | (0.95, 1.09) 0.639 | Excluded |
| independ | (0.91, 1.05) 0.587 | Excluded |
| selfcontrol | (0.89, 1.02) 0.202 | (0.90, 0.98) 0.007 |
| anxiety | (0.89, 1.02) 0.158 | (0.90, 1.00) 0.060 |
| novator | (0.95, 1.07) 0.766 | Excluded |
We will now use our adjusted model to predict 3-year survival rates between two employees, one age 25 and one age 55. All other variables will remain the same.
new_hire <- data.frame(age = c(25,55), industry = c("IT","IT"), stag= c(36, 36), event= c(0,0), profession = c("manage","manage"), traffic = c("empjs","empjs"), greywage = c("white","white"), way = c("car","car"), independ = c(5,5), selfcontrol = c(5,5), anxiety = c(5,5))
preds <- predict(model1, newdata = new_hire, type = "survival", se.fit = TRUE)
new_hire$prob <- preds$fit
new_hire$lcl <- preds$fit - 1.96*preds$se.fit
new_hire$ucl <- preds$fit + 1.96*preds$se.fit
new_hireOur results show a survival probability of 38% for the 25-year old, and 17% percent for the 55-year old.
By Hermela Shimelis
The colon cancer data is a built-in data set in the Survival R package [6]
Data set includes 929 subjects with stage B/C colon cancer who were randomized to three treatment groups then followed for 8-years.
The time to death or progression is given in days. The data set is filtered to evaluate time to death
Objective: Model the relationship between survival time and treatment
| Category | Variables |
|---|---|
| Treatment | Observation (no treatment) |
| Amisole (Lev) | |
| Amisole + 5-FU | |
| Patient Characteristics | Age |
| Sex | |
| Tumor Characteristics | Colon perforation and obstruction |
| Adherence to nearby organs | |
| Tumor differentiation | |
| Local spread | |
| Differentiation of tumor (well, moderate, poor) | |
| More than 4 positive lymph nodes | |
| Other | Time from surgery to registration |
| Observation (%) | Amisole (%) | Amisole + 5-FU (%) | ||
|---|---|---|---|---|
| N=315 | N=310 | N=304 | ||
| Demographics | ||||
| Male | 166 (52.3) | 177 (57.1) | 141 | |
| Median age (years) [IQR] | 60 [53,68] | 61 [53,69] | 61 [52,70] | |
| Cancer characteristics | ||||
| Colon obstruction | 63 (20.0) | 63 (20.3) | 54 (17.8) | |
| Colon perforation | 9 (2.9) | 10 (3.2) | 8 (2.6) | |
| Adherence to nearby organs | 47 (14.9) | 49 (15.8) | 39 (12.8) | |
| Differentiation of tumor | ||||
| Well | 27 (8.6) | 37 (11.9) | 29 (9.5) | |
| Moderate | 236 (74.9) | 229 (73.9) | 221 (72.7) | |
| Poor | 52 (16.5) | 44 (14.2) | 54 (17.8) | |
| Extent of local spread | ||||
| Contiguous | 20 (6.3) | 12 (3.9) | 11 (3.6) | |
| Muscle | 38 (12.1) | 36 (11.6) | 32 (10.5) | |
| Serosa | 249 (79.0) | 259 (83.5) | 251 (82.6) | |
| Submucosa | 8 (2.5) | 3 (1.0) | 10 (3.3) | |
| More than 4 lymph nodes with cancer | Yes | 87 (27.6) | 89 (28.7) | 79 (26.0) |
| Short time from surgery to registration (%) | Yes | 91 (28.9) | 80 (25.8) | 76 (25.0) |
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.97 | 0.78, 1.21 | 0.8 |
| Lev+5FU | 0.69 | 0.55, 0.87 | 0.002 |
| Concordance = 0.536 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.98 | 0.79, 1.22 | 0.9 |
| Lev+5FU | 0.69 | 0.54, 0.87 | 0.002 |
| age | 1.01 | 1.00, 1.02 | 0.083 |
| sex | 1.04 | 0.86, 1.26 | 0.7 |
| perfor | 1.00 | 0.59, 1.70 | >0.9 |
| adhere | 1.18 | 0.92, 1.53 | 0.2 |
| surg | 1.27 | 1.03, 1.55 | 0.022 |
| obstruct | 1.33 | 1.06, 1.68 | 0.015 |
| differentiation | |||
| moderate | — | — | |
| poor | 1.43 | 1.13, 1.82 | 0.003 |
| well | 1.08 | 0.78, 1.50 | 0.6 |
| node4 | 2.55 | 2.10, 3.09 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.39 | 0.23, 0.64 | <0.001 |
| serosa | 0.64 | 0.43, 0.94 | 0.023 |
| submucosa | 0.29 | 0.10, 0.83 | 0.021 |
| Concordance = 0.674 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.99 | 0.80, 1.23 | >0.9 |
| Lev+5FU | 0.69 | 0.54, 0.87 | 0.002 |
| age | 1.01 | 1.00, 1.02 | 0.069 |
| surg | 1.28 | 1.04, 1.56 | 0.018 |
| obstruct | 1.33 | 1.06, 1.67 | 0.015 |
| differentiation | |||
| moderate | — | — | |
| poor | 1.45 | 1.15, 1.84 | 0.002 |
| well | 1.07 | 0.77, 1.48 | 0.7 |
| node4 | 2.53 | 2.09, 3.07 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.37 | 0.23, 0.61 | <0.001 |
| serosa | 0.61 | 0.41, 0.89 | 0.010 |
| submucosa | 0.27 | 0.09, 0.76 | 0.014 |
| Concordance = 0.672 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 2.335 | 2 | 0.311 | rx |
| age | 0.549 | 1 | 0.459 | age |
| surg | 0.020 | 1 | 0.888 | surg |
| obstruct | 6.148 | 1 | 0.013 | obstruct |
| differentiation | 17.459 | 2 | 0.000 | differentiation |
| node4 | 5.662 | 1 | 0.017 | node4 |
| local_spread | 7.083 | 3 | 0.069 | local_spread |
| GLOBAL | 37.525 | 11 | 0.000 | GLOBAL |
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.98 | 0.79, 1.22 | 0.9 |
| Lev+5FU | 0.71 | 0.56, 0.89 | 0.003 |
| age | 1.01 | 1.00, 1.02 | 0.034 |
| surg | 1.30 | 1.06, 1.59 | 0.012 |
| node4 | 2.50 | 2.06, 3.04 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.34 | 0.21, 0.56 | <0.001 |
| serosa | 0.58 | 0.39, 0.84 | 0.004 |
| submucosa | 0.24 | 0.08, 0.67 | 0.007 |
| Concordance = 0.674 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 2.001 | 2 | 0.368 | rx |
| age | 0.670 | 1 | 0.413 | age |
| surg | 0.014 | 1 | 0.905 | surg |
| node4 | 4.288 | 1 | 0.038 | node4 |
| local_spread | 5.298 | 3 | 0.151 | local_spread |
| GLOBAL | 12.411 | 8 | 0.134 | GLOBAL |
| Model | Description | AIC | BIC | C_Index |
|---|---|---|---|---|
| Model 1 | Base model | 5860.383 | 5860.383 | 0.500 |
| Model 2 | Treatment | 5852.236 | 5860.463 | 0.536 |
| Model 3 | Full variables | 5741.401 | 5798.993 | 0.674 |
| Model 4 | Stepwise-selected variables | 5737.261 | 5782.511 | 0.672 |
| Model 5 | Stratified | 4567.829 | 4600.739 | 0.674 |
Original c-index: 0.65
Mean cross-validated c-Index: 0.64
| id: | id |
| study: | 1 for all patients |
| rx: | Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU |
| sex: | 1=male |
| age: | in years |
| obstruct: | obstruction of colon by tumour |
| perfor: | perforation of colon |
| adhere: | adherence to nearby organs |
| nodes: | number of lymph nodes with detectable cancer |
| time: | days until event or censoring |
| status: | censoring status |
| differ: | differentiation of tumour (1=well, 2=moderate, 3=poor) |
| extent: | Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures) |
| surg: | time from surgery to registration (0=short, 1=long) |
| node4: | more than 4 positive lymph nodes |
| etype: | event type: 1=recurrence,2=death |